All of the tools that are strictly necessary for clustering are available in base R. For full flexibility, however, the ggdendro, protoclust, and heatmaply packages are recommended. If you want to explore further possibilities, look at the cluster package.
library(tidyverse)
[37mโโ [1mAttaching packages[22m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse 1.2.1 โโ[39m
[37m[32mโ[37m [34mggplot2[37m 3.1.0 [32mโ[37m [34mpurrr [37m 0.3.2
[32mโ[37m [34mtibble [37m 2.1.1 [32mโ[37m [34mdplyr [37m 0.8.0.[31m1[37m
[32mโ[37m [34mtidyr [37m 0.8.3 [32mโ[37m [34mstringr[37m 1.4.0
[32mโ[37m [34mreadr [37m 1.3.1 [32mโ[37m [34mforcats[37m 0.4.0 [39m
[37mโโ [1mConflicts[22m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse_conflicts() โโ
[31mโ[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mโ[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
library(tidymodels)
[37mโโ [1mAttaching packages[22m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidymodels 0.0.2 โโ[39m
[37m[32mโ[37m [34mbroom [37m 0.5.1 [32mโ[37m [34mrecipes [37m 0.1.4
[32mโ[37m [34mdials [37m 0.0.2 [32mโ[37m [34mrsample [37m 0.0.4
[32mโ[37m [34minfer [37m 0.4.0 [32mโ[37m [34myardstick[37m 0.0.3
[32mโ[37m [34mparsnip [37m 0.0.1 [39m
[37mโโ [1mConflicts[22m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidymodels_conflicts() โโ
[31mโ[37m [34mscales[37m::[32mdiscard()[37m masks [34mpurrr[37m::discard()
[31mโ[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mโ[37m [34mrecipes[37m::[32mfixed()[37m masks [34mstringr[37m::fixed()
[31mโ[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()
[31mโ[37m [34myardstick[37m::[32mspec()[37m masks [34mreadr[37m::spec()
[31mโ[37m [34mrecipes[37m::[32mstep()[37m masks [34mstats[37m::step()[39m
library(ggdendro)
library(protoclust)
library(heatmaply)
Loading required package: plotly
Attaching package: โplotlyโ
The following object is masked from โpackage:ggplot2โ:
last_plot
The following object is masked from โpackage:statsโ:
filter
The following object is masked from โpackage:graphicsโ:
layout
Loading required package: viridis
Loading required package: viridisLite
Attaching package: โviridisโ
The following object is masked from โpackage:scalesโ:
viridis_pal
======================
Welcome to heatmaply version 0.15.2
Type citation('heatmaply') for how to cite the package.
Type ?heatmaply for the main documentation.
The github page is: https://github.com/talgalili/heatmaply/
Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
Or contact: <tal.galili@gmail.com>
======================
library(spotifyr)
library(compmus)
Attaching package: โcompmusโ
The following object is masked from โpackage:spotifyrโ:
get_playlist_audio_features
source('spotify.R')
The Bibliothรจque nationale de France (BnF) makes a large portion of its music collection available on Spotify, including an eclectic collection of curated playlists. The defining musical characteristics of these playlists are sometimes unclear: for example, they have a Halloween playlist. Perhaps clustering can help us organise and describe what kinds of musical selections make it into the BnFโs playlist.
We begin by loading the playlist and summarising the pitch and timbre features, just like last week. Note that, also like last week, we use compmus_c_transpose to transpose the chroma features so that โ depending on the accuracy of Spotifyโs key estimation โ we can interpret them as if every piece were in C major or C minor. Although this example includes no delta features, try adding them yourself if you are feeling comfortable with R!
halloween <-
get_playlist_audio_features('bnfcollection', '1vsoLSK3ArkpaIHmUaF02C') %>%
add_audio_analysis %>%
mutate(
segments =
map2(segments, key, compmus_c_transpose)) %>%
mutate(
segments =
map(
segments,
mutate,
delta_timbre = map2(timbre, lag(timbre), `-`))) %>%
mutate(
pitches =
map(segments,
compmus_summarise, pitches,
method = 'mean', norm = 'manhattan'),
timbre =
map(
segments,
compmus_summarise, timbre,
method = 'mean'),
delta_timbre =
map(
segments,
compmus_summarise, delta_timbre,
method = 'mean')) %>%
mutate(pitches = map(pitches, compmus_normalise, 'clr')) %>%
mutate_at(vars(pitches, timbre, delta_timbre), map, bind_rows) %>%
unnest(pitches, timbre, delta_timbre)
Remember that in the tidyverse approach, we can preprocess data with a recipe. In this case, instead of a label that we want to predict, we start with a label that will make the cluster plots readable. For most projects, the track name will be the best choice (although feel free to experiment with others). The code below uses str_trunc to clip the track name to a maximum of 20 characters, again in order to improve readability. The other change from last week is column_to_rownames, which is necessary for the plot labels to appear correctly.
Last week we also discussed that although standardising variables with step_center to make the mean 0 and step_scale to make the standard deviation 1 is the most common approach, sometimes step_range is a better alternative, which squashes or stretches every features so that it ranges from 0 to 1. For most classification algorithms, the difference is small; for clustering, the differences can be more noticable. Itโs wise to try both.
halloween_juice <-
recipe(track_name ~
danceability +
energy +
loudness +
speechiness +
acousticness +
instrumentalness +
liveness +
valence +
tempo +
duration_ms +
C + `C#|Db` + D + `D#|Eb` +
E + `F` + `F#|Gb` + G +
`G#|Ab` + A + `A#|Bb` + B +
c01 + c02 + c03 + c04 + c05 + c06 +
c07 + c08 + c09 + c10 + c11 + c12,
data = halloween) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
# step_range(all_predictors()) %>%
prep(halloween %>% mutate(track_name = str_trunc(track_name, 20))) %>%
juice %>%
column_to_rownames('track_name')
When using step_center and step_scale, then the Euclidean distance is usual. When using step_range, then the Manhattan distance is also a good choice: this combination is known as Gowerโs distance and has a long history in clustering.
halloween_dist <- dist(halloween_juice, method = 'euclidean')
As you learned in your DataCamp exercises this week, there are three primary types of linkage: single, average, and complete. Usually average or complete give the best results. We can use the ggendrogram function to make a more standardised plot of the results.
hclust(halloween_dist, method = 'single') %>% dendro_data %>% ggdendrogram
A more recent โ and often superior โ linkage function is minimax linkage, available in the protoclust package. It is more akin to \(k\)-means: at each step, it chooses an ideal centroid for every cluster such that the maximum distance between centroids and all members of their respective clusters is as small as possible.
protoclust(halloween_dist) %>% dendro_data %>% ggdendrogram
Try all four of these linkages. Which one looks the best? Which one sounds the best (when you listen to the tracks on Spotify)? Can you guess which features are separating the clusters?
Unlike hierarchical clustering, k-means clustering returns a different results every time. Nonetheless, it can be a useful reality check on the stability of the clusters from hierarchical clustering.
kmeans(halloween_juice, 4)
K-means clustering with 4 clusters of sizes 5, 7, 2, 6
Cluster means:
danceability energy loudness speechiness acousticness instrumentalness liveness valence tempo
1 0.64836983 -0.49710006 0.1245002 0.05224442 0.3870865 -0.367575835 0.62657491 0.2517858 -0.57764542
2 -1.20286793 -0.65481004 -0.8180242 -0.69484670 0.1960202 -0.002641033 0.08514692 -1.0319293 -0.13884228
3 0.09605479 0.07213439 0.5262965 -0.48145689 0.1384264 1.105519968 0.23473711 0.1471131 -0.02038562
4 0.83101946 1.15415030 0.6751792 0.92760310 -0.5974045 -0.059112255 -0.69972953 0.9450583 0.65014905
duration_ms C C#|Db D D#|Eb E F F#|Gb G G#|Ab
1 -0.07050269 -0.34809410 0.50342957 -0.4354586 -0.4435755 -1.0830882 -0.6567164 0.4805615 0.47929858 1.0426814710
2 0.36331256 0.03424285 0.12208631 -0.2036450 0.6233577 0.4252545 0.3213209 -0.6423839 -0.09781781 0.0001152723
3 -0.12646175 0.44130883 -1.77309313 1.3525471 -0.5766835 1.7594651 -0.5345336 -0.4481507 0.17806751 -2.0105648218
4 -0.32295850 0.10302548 0.02907237 0.1496190 -0.1653766 -0.1800450 0.3505671 0.4983636 -0.34465054 -0.1988474363
A A#|Bb B c01 c02 c03 c04 c05 c06 c07
1 0.3224897 0.78538950 -0.5982347 -0.6921753 -0.5563510 -0.5489469 -0.06467015 0.2282925 0.8266389 -0.164543931
2 -0.3356205 0.03918216 -0.3162802 -0.2106414 -0.2500977 0.3978360 -0.81726256 0.1383504 -0.8499680 -0.001680086
3 1.2442431 -1.91137400 1.8355507 0.9626355 1.2067481 1.7439574 1.40708521 0.5926782 0.1993832 -0.998983423
4 -0.2919318 -0.06307911 0.2556722 0.5016825 0.3531571 -0.5880053 0.53833638 -0.5492119 0.2363025 0.472074517
c08 c09 c10 c11 c12
1 0.3382651 0.8217280 0.4590651 -0.9741404 -0.2838794
2 0.1770037 0.4068887 -0.6793797 0.9025014 0.6023293
3 -1.4265712 -0.9808208 -0.4151848 0.5479418 -0.2223407
4 -0.0128682 -0.8325366 0.5484503 -0.4237820 -0.3920377
Clustering vector:
I Put a Spell on You Close Your Eyes Evil Woman Time to Kill Evil Bad Woman
4 2 4 1 1
The Princess of Evil 'Round Midnight Tana's Theme - Fr... Evil Blues Someone Is Watching
2 2 3 1 2
Violin Sonata in ... Cannibal Pot Red Devil You're Not Living... Little Demon
3 4 4 2 4
Devil at 4 O'Cloc... Old Devil Moon Up Jumped the Devil Beetween the Devi... Devil Woman
2 4 1 1 2
Within cluster sum of squares by cluster:
[1] 104.86406 139.55752 29.47158 127.08693
(between_SS / total_SS = 37.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault"
Especially for storyboards, it can be helpful to visualise hierarchical clusterings along with heatmaps of feature values. We can do that with heatmaply. Although the interactive heatmaps are flashly, think carefully when deciding whether this representation is more helpful for your storyboard than the simpler dendrograms above.
grDevices::dev.size("px")
[1] 525.0000 324.4747
heatmaply(
halloween_juice,
hclustfun = hclust,
# hclustfun = protoclust,
# Comment out the hclust_method line when using protoclust.
hclust_method = 'average',
dist_method = 'euclidean')
Which features seem to be the most and least useful for the clustering? What happens if you re-run this notebook using only the best features?